This lab focuses on SMS message spam detection using decision tree and random forest classification. It's a direct counterpart to the rule-based spam detection from Lab 05. At the end of the lab, you should be able to use scikit-learn
to:
Let's start by importing the packages we'll need. As usual, we'll import pandas
for exploratory analysis, but this week we're also going to use the tree
subpackage from scikit-learn
to create decision tree models and the ensemble
subpackage to create random forest models. We'll also use the dummy
package to build a baseline model from we which can gauge how good our final model is, just like we did in Lab 06.
In [ ]:
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
Next, let's load the data. Write the path to your sms.csv file in the cell below:
In [ ]:
data_file = 'data/sms.csv'
Execute the cell below to load the CSV data into a pandas data frame with the columns label
and message
.
Note: This week, the CSV file is not comma separated, but instead tab separated. We can tell
pandas
about the different format using thesep
argument, as shown in the cell below. For more information, see theread_csv
documentation.
In [ ]:
sms = pd.read_csv(data_file, sep='\t', header=None, names=['label', 'message'])
sms.head()
Finally, let's select our feature matrix $X$ and target variable $y$ from the data:
In [ ]:
X = sms['message']
y = sms['label']
Before we build the decision tree model, let's build a dummy model, i.e. a naive model that predicts new values using a simple strategy. Dummy models are usually not good predictors (we usually won't use them to solve real problems), but are useful because they provide a baseline accuracy measurement for the data set, from which we can gauge the accuracy of any further models we build. In Lab 06, we built a dummy regression model as a baseline for our linear regression model.
scikit-learn
provides dummy model functionality via the dummy
subpackage. This subpackage provides both dummy regression and classification algorithms, which can be customised with different strategies. We can use the DummyClassifier
class to build a dummy classification model. DummyClassifier
supports five different strategies for predicting values:
strategy='stratified'
: Predict new values randomly, but in proportion to their frequency in the training set.strategy='most_frequent'
: Predict new values as the most frequently occurring target variable in the training set.strategy='prior'
: Predict new values as the most frequently occurring target variable in the training set. This is essentially the same as strategy='most_frequent'
, but returns different values when the predict_proba
method is called.strategy='uniform'
: Predict new values randomly, with equal probability.strategy='constant'
: Predict new values as some constant value (the constant value must also be specified using the constant
keyword argument).Our data is in the form of raw text. This was fine when we were using the custom rule-based model in Lab 05, but this was a bit of a cheat because the model was written specially for that lab. In general, algorithms can only consume numerical values, and so we'll need to transform the data into a suitable representation. One popular way to do this with text data is to compute the TF-IDF score for each word in our dataset, just as we did in Lab 04b.
The TF-IDF score is composed of the term frequency (TF) and inverse document frequency (IDF) measures:
Typically, term frequency and inverse document frequency are combined as a single metric, term frequency-inverse document frequency (TF-IDF), which is simply the multiple of the individual values. Consequently, if a term has a high TF-IDF score, its presence across a set of documents (e.g. SMS messages) is low, while its number of occurrences in a given document (e.g. a candidate SMS message under evaluation) is high. If a term has a low TF-IDF score, this is an indicator that it doesn't appear very frequently in a given document, occurs very frequently across the set of documents, or both. We can exploit this information to find terms that can distinguish a certain set of documents (e.g. spam) from a larger set of documents.
In Lab 04b, we used the TfidfVectorizer
class to compute TF-IDF scores for the words in the messages in our SMS dataset. Now, we need to use its numerical output as the input to our dummy model. While we could save the output to a temporary variable, and feed this to the input of our dummy model, there is a convenient alternative.
In scikit-learn
, pipelines represent chains of transformation operations ending in an estimator (i.e. a classifier or regressor). Once a pipeline has been created, we can treat it just like a model, calling the fit
method to fit our data and the predict
method to predict new values. The general structure of a pipeline looks like this:
+----------------+ +---------------+ +-----------+
| Transformer #1 +---->+ Transfomer #2 +---->...---->+ Estimator |
+----------------+ +---------------+ +-----------+
In this case, we want to build a pipeline consisting of a TD-IDF scorer followed by a dummy model. We can do this using the make_pipeline
function from the pipeline
subpackage of scikit-learn
, as follows:
In [ ]:
pipeline = make_pipeline(
TfidfVectorizer(stop_words='english'), # Remove English stop words before computing TF-IDF
DummyClassifier(strategy='stratified', random_state=0) # Use a stratified prediction strategy
)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) # Stratified 5 fold cross validation
y_pred = cross_val_predict(pipeline, X, y, cv=outer_cv) # Make predictions via cross validation
print(classification_report(y, y_pred)) # Print the classification report
As can be seen, the dummy model results are poor - much worse than our rule-based model from Lab 05:
Still, the dummy model is useful as a "worst case" baseline to which we can compare our further analysis.
Let's start by building a decision tree classification model of the SMS message data. scikit-learn
supports decision tree functionality via the tree
subpackage. This subpackage supports both decision tree regression and classification. We can use the DecisionTreeClassifier
class to build our model.
DecisionTreeClassifier
accepts a number of different hyperparameters and the model we build may be more or less accurate depending on their values. We can get a list of these modelling parameters using the get_params
method of the estimator (this works on any scikit-learn
estimator), like this:
In [ ]:
DecisionTreeClassifier().get_params()
You can find a more detailed description of each parameter in the scikit-learn
documentation.
Let's use a grid search to select the optimal decision tree classification model from a set of candidates. First, we need to build a pipeline, just as we did above. Next, we define the parameter grid. Finally, we use a grid search to select the best model via an inner cross validation and an outer cross validation to measure the accuracy of the selected model.
Note: When using grid search with pipelines, we have to adjust the names of our hyperparameters, prepending the name of the class they apply to (in lowercase). This is so that
scikit-learn
can distinguish which hyperparameters apply to what classes. Below, we prepend the string'decisiontreeclassifier__'
to each hyperparameter name because they all apply to theDecisionTreeClassifier
class.
In [ ]:
pipeline = make_pipeline(
TfidfVectorizer(stop_words='english'),
DecisionTreeClassifier(random_state=0) # Control randomness, so the lab works the same for everyone
)
# Build models for different values of criterion, min_samples_leaf and min_samples_split
parameters = {
'decisiontreeclassifier__criterion': ['gini', 'entropy'],
'decisiontreeclassifier__min_samples_leaf': [1, 10, 20],
'decisiontreeclassifier__min_samples_split': [2, 10, 20] # Min value is 2
}
# Use inner CV to select the best model
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) # K = 5
clf = GridSearchCV(pipeline, parameters, cv=inner_cv, n_jobs=-1) # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)
# Use outer CV to evaluate the error of the best model
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)
print(classification_report(y, y_pred)) # Print the classification report
The model is much more accurate than both the dummy model above and the rule-based model from lab 05. Specifically, we can say that:
This is a significant increase in spam detection ability over both the dummy model and the rule-based model from Lab 05! Further improvements can be made by expanding the ranges of parameter grid values or introducing further hyperparameters (e.g. other stopping criteria).
After fitting a pipeline object, we can check the parameters of the final model as follows:
In [ ]:
clf.best_params_
It's also possible to plot a flow chart representation of a decision tree, although this is beyond the scope of this lab.
Decision trees are useful, but tend to create models that are highly sensitive to slight changes in the values of their input features. One solution to this drawback is the use of random forests, which tend to blur the decision boundaries in regions where there is uncertainty as to which class is the correct choice.
Let's build a random forest classification model of the SMS message data to see if we can improve over the simple decision tree model. scikit-learn
supports ensemble model functionality via the ensemble
subpackage. This subpackage supports random forest regression and classification, as well as several other ensemble modelling algorithms. We can use the RandomForestClassifier
class to build our model.
Just like earlier, we can list the hyperparameters of the algorithm by calling the get_params
method on it:
In [ ]:
RandomForestClassifier().get_params()
As can be seen, we can choose to specify the same hyperparameters as with decision tree classifiers (plus a few more - see the official documentation for more detail). But let's keep it simple and just specify the number of subtrees to include in the forest (feel free to adjust or add to the grid search parameters, if you want to). We can do this using the n_estimators
attribute, as follows:
In [ ]:
pipeline = make_pipeline(
TfidfVectorizer(stop_words='english'),
RandomForestClassifier(random_state=0) # Control randomness, so the lab works the same for everyone
)
# Build models for different numbers of estimators
parameters = {
'randomforestclassifier__n_estimators': [5, 10, 25, 50]
}
# Use inner CV to select the best model
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) # K = 5
clf = GridSearchCV(pipeline, parameters, cv=inner_cv, n_jobs=-1) # n_jobs=-1 uses all available CPUs = faster
clf.fit(X, y)
# Use outer CV to evaluate the error of the best model
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) # K = 10, doesn't have to be the same
y_pred = cross_val_predict(clf, X, y, cv=outer_cv)
print(classification_report(y, y_pred)) # Print the classification report
As can be seen, the random forest model is the most accurate spam detector that we've built so far:
Further improvements can be made by expanding the ranges of parameter grid values or introducing further hyperparameters (e.g. impurity measures, stopping criteria).